{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Movie recommendation system with Qdrant space vectors\n",
    "\n",
    "This notebook is a simple example of how to use Qdrant to build a movie recommendation system.\n",
    "We will use the MovieLens dataset and Qdrant to build a simple recommendation system.\n",
    "\n",
    "## How it works\n",
    "\n",
    "MovieLens dataset contains a list of movies and ratings given by users. We will use this data to build a recommendation system.\n",
    "\n",
    "Our recommendation system will use an approach called **collaborative filtering**.\n",
    "\n",
    "The idea behind collaborative filtering is that if two users have similar tastes, then they will like similar movies.\n",
    "We will use this idea to find the most similar users to our own ratings and see what movies these similar users liked, which we haven't seen yet.\n",
    "\n",
    "\n",
    "1. We will represent each user's ratings as a vector in a sparse high-dimensional space.\n",
    "2. We will use Qdrant to index these vectors.\n",
    "3. We will use Qdrant to find the most similar users to our own ratings.\n",
    "4. We will see what movies these similar users liked, which we haven't seen yet.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install qdrant-client pandas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Download and unzip the dataset\n",
    "\n",
    "!mkdir -p data\n",
    "!wget https://files.grouplens.org/datasets/movielens/ml-1m.zip\n",
    "!unzip ml-1m.zip -d data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 26,
   "metadata": {},
   "outputs": [],
   "source": [
    "from qdrant_client import QdrantClient, models\n",
    "import pandas as pd"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 27,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>user_id</th>\n",
       "      <th>gender</th>\n",
       "      <th>age</th>\n",
       "      <th>occupation</th>\n",
       "      <th>zip</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>F</td>\n",
       "      <td>1</td>\n",
       "      <td>10</td>\n",
       "      <td>48067</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>M</td>\n",
       "      <td>56</td>\n",
       "      <td>16</td>\n",
       "      <td>70072</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>M</td>\n",
       "      <td>25</td>\n",
       "      <td>15</td>\n",
       "      <td>55117</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>M</td>\n",
       "      <td>45</td>\n",
       "      <td>7</td>\n",
       "      <td>02460</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>M</td>\n",
       "      <td>25</td>\n",
       "      <td>20</td>\n",
       "      <td>55455</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6035</th>\n",
       "      <td>6036</td>\n",
       "      <td>F</td>\n",
       "      <td>25</td>\n",
       "      <td>15</td>\n",
       "      <td>32603</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6036</th>\n",
       "      <td>6037</td>\n",
       "      <td>F</td>\n",
       "      <td>45</td>\n",
       "      <td>1</td>\n",
       "      <td>76006</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6037</th>\n",
       "      <td>6038</td>\n",
       "      <td>F</td>\n",
       "      <td>56</td>\n",
       "      <td>1</td>\n",
       "      <td>14706</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6038</th>\n",
       "      <td>6039</td>\n",
       "      <td>F</td>\n",
       "      <td>45</td>\n",
       "      <td>0</td>\n",
       "      <td>01060</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>6039</th>\n",
       "      <td>6040</td>\n",
       "      <td>M</td>\n",
       "      <td>25</td>\n",
       "      <td>6</td>\n",
       "      <td>11106</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>6040 rows × 5 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      user_id gender  age  occupation    zip\n",
       "0           1      F    1          10  48067\n",
       "1           2      M   56          16  70072\n",
       "2           3      M   25          15  55117\n",
       "3           4      M   45           7  02460\n",
       "4           5      M   25          20  55455\n",
       "...       ...    ...  ...         ...    ...\n",
       "6035     6036      F   25          15  32603\n",
       "6036     6037      F   45           1  76006\n",
       "6037     6038      F   56           1  14706\n",
       "6038     6039      F   45           0  01060\n",
       "6039     6040      M   25           6  11106\n",
       "\n",
       "[6040 rows x 5 columns]"
      ]
     },
     "execution_count": 27,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "users = pd.read_csv(\n",
    "    './data/ml-1m/users.dat',\n",
    "    sep='::',\n",
    "    names=[\n",
    "        'user_id',\n",
    "        'gender',\n",
    "        'age',\n",
    "        'occupation',\n",
    "        'zip'\n",
    "    ],\n",
    "    engine='python'\n",
    ")\n",
    "users"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 28,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>movie_id</th>\n",
       "      <th>title</th>\n",
       "      <th>genres</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>1</td>\n",
       "      <td>Toy Story (1995)</td>\n",
       "      <td>Animation|Children's|Comedy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2</td>\n",
       "      <td>Jumanji (1995)</td>\n",
       "      <td>Adventure|Children's|Fantasy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>3</td>\n",
       "      <td>Grumpier Old Men (1995)</td>\n",
       "      <td>Comedy|Romance</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>4</td>\n",
       "      <td>Waiting to Exhale (1995)</td>\n",
       "      <td>Comedy|Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>5</td>\n",
       "      <td>Father of the Bride Part II (1995)</td>\n",
       "      <td>Comedy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>...</th>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "      <td>...</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3878</th>\n",
       "      <td>3948</td>\n",
       "      <td>Meet the Parents (2000)</td>\n",
       "      <td>Comedy</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3879</th>\n",
       "      <td>3949</td>\n",
       "      <td>Requiem for a Dream (2000)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3880</th>\n",
       "      <td>3950</td>\n",
       "      <td>Tigerland (2000)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3881</th>\n",
       "      <td>3951</td>\n",
       "      <td>Two Family House (2000)</td>\n",
       "      <td>Drama</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3882</th>\n",
       "      <td>3952</td>\n",
       "      <td>Contender, The (2000)</td>\n",
       "      <td>Drama|Thriller</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>3883 rows × 3 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "      movie_id                               title  \\\n",
       "0            1                    Toy Story (1995)   \n",
       "1            2                      Jumanji (1995)   \n",
       "2            3             Grumpier Old Men (1995)   \n",
       "3            4            Waiting to Exhale (1995)   \n",
       "4            5  Father of the Bride Part II (1995)   \n",
       "...        ...                                 ...   \n",
       "3878      3948             Meet the Parents (2000)   \n",
       "3879      3949          Requiem for a Dream (2000)   \n",
       "3880      3950                    Tigerland (2000)   \n",
       "3881      3951             Two Family House (2000)   \n",
       "3882      3952               Contender, The (2000)   \n",
       "\n",
       "                            genres  \n",
       "0      Animation|Children's|Comedy  \n",
       "1     Adventure|Children's|Fantasy  \n",
       "2                   Comedy|Romance  \n",
       "3                     Comedy|Drama  \n",
       "4                           Comedy  \n",
       "...                            ...  \n",
       "3878                        Comedy  \n",
       "3879                         Drama  \n",
       "3880                         Drama  \n",
       "3881                         Drama  \n",
       "3882                Drama|Thriller  \n",
       "\n",
       "[3883 rows x 3 columns]"
      ]
     },
     "execution_count": 28,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "movies = pd.read_csv(\n",
    "    './data/ml-1m/movies.dat',\n",
    "    sep='::',\n",
    "    names=[\n",
    "        'movie_id',\n",
    "        'title',\n",
    "        'genres'\n",
    "    ],\n",
    "    engine='python',\n",
    "    encoding='latin-1'\n",
    ")\n",
    "movies"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 29,
   "metadata": {},
   "outputs": [],
   "source": [
    "ratings = pd.read_csv(\n",
    "    './data/ml-1m/ratings.dat',\n",
    "    sep='::',\n",
    "    names=[\n",
    "        'user_id',\n",
    "        'movie_id',\n",
    "        'rating',\n",
    "        'timestamp'\n",
    "    ],\n",
    "    engine='python'\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 30,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Normalize ratings\n",
    "\n",
    "# Sparse vectors can use advantage of negative values, so we can normalize ratings to have mean 0 and std 1\n",
    "# In this scenario we can take into account movies that we don't like\n",
    "\n",
    "ratings.rating = (ratings.rating - ratings.rating.mean()) / ratings.rating.std()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 31,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Convert ratings to sparse vectors\n",
    "\n",
    "from collections import defaultdict\n",
    "\n",
    "user_sparse_vectors = defaultdict(lambda: {\n",
    "    \"values\": [],\n",
    "    \"indices\": []\n",
    "})\n",
    "\n",
    "for row in ratings.itertuples():\n",
    "    user_sparse_vectors[row.user_id][\"values\"].append(row.rating)\n",
    "    user_sparse_vectors[row.user_id][\"indices\"].append(row.movie_id)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 32,
   "metadata": {},
   "outputs": [],
   "source": [
    "# For this small dataset we can use in-memory Qdrant\n",
    "# But for production we recommend to use server-based version\n",
    "\n",
    "qdrant = QdrantClient(\":memory:\") # or QdrantClient(\"http://localhost:6333\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 33,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "True"
      ]
     },
     "execution_count": 33,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Create collection with configured sparse vectors\n",
    "# Sparse vectors don't require to specify dimension, because it's extracted from the data automatically\n",
    "\n",
    "qdrant.create_collection(\n",
    "    \"movielens\",\n",
    "    vectors_config={},\n",
    "    sparse_vectors_config={\n",
    "        \"ratings\": models.SparseVectorParams()\n",
    "    }\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 34,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Upload all user's votes as sparse vectors\n",
    "\n",
    "def data_generator():\n",
    "    for user in users.itertuples():\n",
    "        yield models.PointStruct(\n",
    "            id=user.user_id,\n",
    "            vector={\n",
    "                \"ratings\": user_sparse_vectors[user.user_id]\n",
    "            },\n",
    "            payload=user._asdict()\n",
    "        )\n",
    "\n",
    "# This will do lazy upload of the data\n",
    "qdrant.upload_points(\n",
    "    \"movielens\",\n",
    "    data_generator()\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Let's try to recommend something for ourselves\n",
    "\n",
    "#  1 - like\n",
    "# -1 - dislike\n",
    "\n",
    "# Search with \n",
    "# movies[movies.title.str.contains(\"Matrix\", case=False)]\n",
    "\n",
    "my_ratings = { \n",
    "    2571: 1,  # Matrix\n",
    "    329: 1,   # Star Trek\n",
    "    260: 1,   # Star Wars\n",
    "    2288: -1, # The Thing\n",
    "    1: 1,     # Toy Story\n",
    "    1721: -1, # Titanic\n",
    "    296: -1,  # Pulp Fiction\n",
    "    356: 1,   # Forrest Gump\n",
    "    2116: 1,  # Lord of the Rings\n",
    "    1291: -1, # Indiana Jones\n",
    "    1036: -1  # Die Hard\n",
    "}\n",
    "\n",
    "inverse_ratings = {k: -v for k, v in my_ratings.items()}\n",
    "\n",
    "def to_vector(ratings):\n",
    "    vector = models.SparseVector(\n",
    "        values=[],\n",
    "        indices=[]\n",
    "    )\n",
    "    for movie_id, rating in ratings.items():\n",
    "        vector.values.append(rating)\n",
    "        vector.indices.append(movie_id)\n",
    "    return vector\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Find users with similar taste\n",
    "\n",
    "results = qdrant.search(\n",
    "    \"movielens\",\n",
    "    query_vector=models.NamedSparseVector(\n",
    "        name=\"ratings\",\n",
    "        vector=to_vector(my_ratings)\n",
    "    ),\n",
    "    with_vectors=True, # We will use those to find new movies\n",
    "    limit=20\n",
    ")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 37,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Calculate how frequently each movie is found in similar users' ratings\n",
    "\n",
    "def results_to_scores(results):\n",
    "    movie_scores = defaultdict(lambda: 0)\n",
    "\n",
    "    for user in results:\n",
    "        user_scores = user.vector['ratings']\n",
    "        for idx, rating in zip(user_scores.indices, user_scores.values):\n",
    "            if idx in my_ratings:\n",
    "                continue\n",
    "            movie_scores[idx] += rating\n",
    "\n",
    "    return movie_scores"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 38,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Star Wars: Episode V - The Empire Strikes Back (1980) 20.023877887283938\n",
      "Star Wars: Episode VI - Return of the Jedi (1983) 16.44318377549194\n",
      "Princess Bride, The (1987) 15.84006760423755\n",
      "Raiders of the Lost Ark (1981) 14.94489407628955\n",
      "Sixth Sense, The (1999) 14.570321651488953\n"
     ]
    }
   ],
   "source": [
    "# Sort movies by score and print top 5\n",
    "\n",
    "movie_scores = results_to_scores(results)\n",
    "top_movies = sorted(movie_scores.items(), key=lambda x: x[1], reverse=True)\n",
    "\n",
    "for movie_id, score in top_movies[:5]:\n",
    "    print(movies[movies.movie_id == movie_id].title.values[0], score)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 39,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Princess Bride, The (1987) 16.214640029038147\n",
      "Star Wars: Episode V - The Empire Strikes Back (1980) 14.652836719595939\n",
      "Blade Runner (1982) 13.52911944519415\n",
      "Usual Suspects, The (1995) 13.446604377087162\n",
      "Godfather, The (1972) 13.300575698740357\n"
     ]
    }
   ],
   "source": [
    "# Find users with similar taste, but only within my age group\n",
    "# We can also filter by other fields, like `gender`, `occupation`, etc.\n",
    "\n",
    "results = qdrant.search(\n",
    "    \"movielens\",\n",
    "    query_vector=models.NamedSparseVector(\n",
    "        name=\"ratings\",\n",
    "        vector=to_vector(my_ratings)\n",
    "    ),\n",
    "    query_filter=models.Filter(\n",
    "        must=[\n",
    "            models.FieldCondition(\n",
    "                key=\"age\",\n",
    "                match=models.MatchValue(value=25)\n",
    "            )\n",
    "        ]\n",
    "    ),\n",
    "    with_vectors=True,\n",
    "    limit=20\n",
    ")\n",
    "\n",
    "movie_scores = results_to_scores(results)\n",
    "top_movies = sorted(movie_scores.items(), key=lambda x: x[1], reverse=True)\n",
    "\n",
    "for movie_id, score in top_movies[:5]:\n",
    "    print(movies[movies.movie_id == movie_id].title.values[0], score)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "venv",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
